This document aims to outline the data structure and summarize the available information from small-scale fishery WCS surveys starting from September 1995. The code and functions used in this analysis are part of an R package within a GitHub repository.
Data Structure
The dataset comprises small-scale fishery survey data collected since September 1995 across various landing sites or Beach Management Units (BMUs). The dataset merges two versions: one from the legacy database (September 1995 - October 2022) and another from the ongoing survey (from December 2022). The data are stored in a MongoDB database and are accessible through the peskas.kenya.data.pipeline package.
The dataset contains the following variables:
Code
data_table<-dplyr::tibble( N. =seq(1, length(names(valid_data))), Variable =names(valid_data), Type =c("Numeric", "Categorical", "Categorical", "Categorical", "Date", "Categorical", "Categorical","Numeric", "Numeric", "Numeric", "Numeric", "Categorical", "Categorical", "Text", "Numeric", "Numeric"), Description =c("Data version identifier, 1 indicates legacy landings, 2 ongoing landings","Consent information for the form","Unique identifier for each survey submission","Unique identifier for each catch in the survey","Date of the fishing landing","Name of the site where landing occurs (BMU)","Name of the fishing ground","Latitude of the fishing site","Longitude of the fishing site","Number of fishers involved in the catch","Number of boats observed on the landing site","Specific type of fishing gear used","Category of fish caught","Size category of the fish caught","Weight of the catch in kilograms","Total weight of the catch in kilograms"))reactable::reactable(data_table, striped =TRUE, highlight =TRUE, pagination =FALSE, columns =list("N."=reactable::colDef(maxWidth =100), Description =reactable::colDef(minWidth =240)), theme =reactable::reactableTheme( borderColor ="#dfe2e5", stripedColor ="#f6f8fa", highlightColor ="#f0f5f9", cellPadding ="8px 12px", style =list( fontFamily ="-apple-system, BlinkMacSystemFont, Segoe UI, Helvetica, Arial, sans-serif")))
Below is a breakdown of the categorical variables:
Code
ranked<-valid_data%>%dplyr::select(dplyr::where(is.character))%>%dplyr::select(-c("submission_id", "catch_id", "version", "form_consent", "fishing_ground", "size"))%>%tidyr::pivot_longer(dplyr::everything(), names_to ="variable", values_to ="value")%>%dplyr::count(variable, value)catch_names_minor<-ranked%>%dplyr::filter(variable=="fish_category")%>%dplyr::mutate( tot =sum(n), perc =n/tot*100)%>%dplyr::filter(perc<0.5)%>%dplyr::pull(value)plt<-ranked%>%dplyr::mutate(value =dplyr::case_when(variable=="fish_category"&value%in%catch_names_minor~"Other",TRUE~value))%>%ggplot(aes(x =tidytext::reorder_within(value, n, variable), y =n))+ggiraph::geom_bar_interactive(aes(fill =variable, tooltip =paste("Value:", value, "<br>Count:", n)), stat ="identity", alpha =0.5)+facet_wrap(~variable, scales ="free")+tidytext::scale_x_reordered()+coord_flip()+theme_minimal()+ggsci::scale_fill_jama()+theme( panel.grid =element_blank(), legend.position ="")+labs( title ="Ranking of categorical variables", x ="", y ="")ggiraph::girafe( ggobj =plt, options =list(ggiraph::opts_tooltip( opacity =0.8, use_fill =TRUE, use_stroke =FALSE, css ="padding:5pt;font-family: Open Sans;font-size:1rem;color:white")))
Preprocessing
The preprocessing operations were performed using the preprocess_legacy_landings and preprocess_landings functions, designed to harmonize two distinct data collection periods: the legacy database (up to October 2022) and the ongoing survey system (from December 2022).
Removed derived variables (e.g., Month, Year, Day) as they can be generated when needed
Stored static information related to landing sites in Google Sheets metadata tables
Converts data into a tidy format, where each column represents a single variable, each row corresponds to one observation, and all variables have the same unit of observation (For more information, see Tidying Data)
Prepares data for subsequent validation checks
A key challenge was harmonizing gear types between the two systems. The legacy database contained multiple gear classifications that needed consolidation:
Traditional fishing nets like prawnseine, jarife, sardinenet, and mosquitonet were grouped into a single “other_nets” category
- Similar methods were unified (e.g., “spear” and “harpoon” became “speargun”, while “longline” and “handline” were standardized as “handline”)
- Major gear types were preserved as distinct categories: gillnet, ringnet, beachseine, and setnet
The ongoing system introduced Swahili terms that needed to be matched with these established categories. For example, “Aina_ya_zana_ya_uvuvi” (fishing gear type) entries were mapped to standardized English terms following specific rules:
- “beachseine” → “speargun” (according to KoboToolbox metadata information)
- “net” → “gillnet”
- “handline” → “setnet”
- “ringnet” → “monofilament”
Fish categories also required standardization. The ongoing system uses Swahili terms with explicit size classifications:
“mchanganyiko” → “rest of catch” (for mixed catches)
The preprocessing system also handles catch reporting inconsistencies. When total catch values don’t match the sum of individual catches, the system:
Uses the sum of individual catches if there’s a mismatch
Trusts the total catch form if individual catches sum to zero but a total is reported
Maintains original values when no discrepancy exists
Validation
The validation process employs a systematic approach to identify and handle potential data quality issues. The process uses different alert codes to flag specific validation concerns:
- Alert code 1: Invalid dates
- Alert code 2: Anomalous number of fishers
- Alert code 3: Unusual number of boats
- Alert code 4: Outliers in individual catch records
- Alert code 5: Anomalous total catch values
Temporal Validation
The most basic validation step ensures landing dates are not earlier than 1990 (alert code 1). Records failing this check have their dates set to NA and are flagged for review.
Statistical Validation Approach
For numerical variables (number of fishers, boats, and catches), we implemented the Median Absolute Deviation (MAD) method to identify outliers. This statistical approach was chosen because it provides a robust way to identify outliers while accounting for the natural variability in fisheries data. Specifically, MAD detects outliers based on the data’s distribution itself, rather than applying an arbitrary threshold. By adjusting the value of k, we control the sensitivity to outliers—lower values of k make the method more sensitive, flagging more points as outliers, while higher values make it less sensitive. This approach ensures that outliers are identified in relation to the natural variability of the data, providing a robust method for detecting anomalies without imposing an external standard.
The method is particularly suitable for fisheries data because it: - Bases thresholds on the data’s own distribution rather than arbitrary cutoffs - Is robust to extreme values and non-normal distributions - Can be tuned through a sensitivity parameter (k)
Sensitivity Parameters
Different k values were chosen based on the expected variability of each metric: - Catch data: k = 2.5 (more sensitive due to high natural variability) - Number of fishers: k = 3 (less sensitive to accommodate different fishing operations) - Number of boats: k = 3 (less sensitive to accommodate fleet variations)
The histogram below demonstrates how MAD identifies outliers, using number of boats as an example. The dashed line shows the outlier threshold at 13.5 boats, calculated with k = 1.5. Lower k values result in more stringent outlier detection, while higher values are more permissive.
Example Data
The histogram below uses simulated data to demonstrate the MAD method’s application. While the numbers are artificial, the principles shown here are the same ones applied to the actual fisheries data.
Code
plt<-ggplot(fake_data, aes(n_boats))+theme_minimal()+geom_histogram(bins =30, fill ="#52817f", color ="white", alpha =0.5)+geom_vline(xintercept =upb, color ="firebrick", linetype =2)+theme(panel.grid =element_blank())+annotate("text", x =upb+0.5, y =Inf, label ="k value = 1.5", color ="firebrick", vjust =2, hjust =0)+scale_y_continuous(labels =scales::comma)+scale_x_continuous(n.breaks =10)+coord_cartesian(expand =FALSE)+labs( x ="Number of boats", y ="Count", title ="Outlier detection for number of boats using MAD")girafe( ggobj =plt, options =list(opts_tooltip( opacity =0.8, use_fill =TRUE, use_stroke =FALSE, css ="padding:5pt;font-family: Open Sans;font-size:1rem;color:white")))
Catch Weight Validation
Catch validation required special attention due to the hierarchical nature of the data. For catch weight data, the validation process is particularly sophisticated as it accounts for variations in catch sizes across different fishing methods. The data were grouped by gear type and fish category to account for these variations, allowing for a more accurate identification of outliers specific to each gear category and fish group.
The system implements:
Individual Catch Validation:
Computes separate thresholds for each gear type and fish category combination
Flags outliers with alert code 4
Accounts for gear-specific catch variations
Total Catch Validation:
Validates the total catch independently
Uses gear type as a grouping factor
Flags anomalies with alert code 5
Cross-checks with sum of individual catches
When an alert is triggered, the corresponding value is set to NA for analytical purposes while preserving the original value and alert code in the database for review.
Data Analysis
Temporal Distribution
The following plot shows survey activity across BMUs. Hover over tiles to view the number of surveys conducted in specific quarters and landing sites:
Code
quarter_format<-function(date){year<-lubridate::year(date)quarter<-lubridate::quarter(date)return(paste0("Q", quarter, "\n", year))}basedata<-valid_data%>%dplyr::mutate(landing_date =as.Date(as.POSIXct(landing_date, origin ="1970-01-01")))%>%dplyr::group_by(submission_id)%>%dplyr::summarise(dplyr::across(dplyr::everything(), ~dplyr::first(.x)))%>%dplyr::ungroup()%>%dplyr::mutate(quarter_date =lubridate::floor_date(landing_date, unit ="quarter"))%>%dplyr::group_by(quarter_date, landing_site)%>%dplyr::count()%>%dplyr::ungroup()%>%tidyr::complete(quarter_date, landing_site, fill =list(n =0))main_plot<-basedata%>%ggplot()+theme_bw()+geom_tile_interactive(aes( x =quarter_date, y =reorder(landing_site, n), fill =n, alpha =log(n), tooltip =paste("Quarter:", quarter_format(quarter_date), "<br>Landing Site:", landing_site, "<br>Surveys:", n)), color ="white", size =0.3)+scale_fill_viridis_c(direction =-1)+coord_cartesian(expand =FALSE)+theme( panel.grid =element_blank(), strip.background =element_rect(fill ="white"), legend.position ="bottom", legend.key.width =unit(1, "cm"), strip.text.y =element_text(angle =0, vjust =0.5, hjust =0.5, face ="bold"),)+guides(alpha ="none")+scale_x_date( date_breaks ="3 years", labels =quarter_format, expand =c(0, 0))+labs( title ="BMUs activity", subtitle ="Ranked temporal distribution of survey activities across BMU's, illustrating survey frequency over time.", x ="Quarter of the Year", fill ="Number of surveys", y ="")side_plot<-basedata%>%dplyr::group_by(landing_site)%>%dplyr::summarise(total_surveys =sum(n, na.rm =TRUE))%>%ggplot()+theme_minimal()+geom_bar_interactive(aes( y =reorder(landing_site, total_surveys), x =total_surveys, tooltip =paste("Landing site:", landing_site, "<br>Total surveys:", total_surveys)), stat ="identity", fill ="#52817f", alpha =0.5)+geom_text_interactive(aes( y =reorder(landing_site, total_surveys), x =total_surveys, label =scales::label_number(scale =1/1000, suffix ="K")(total_surveys), tooltip =paste("Total surveys:", total_surveys)), hjust =-0.1, size =3)+labs(x ="", y ="")+theme( panel.grid =element_blank(), axis.text.y =element_blank(), axis.text.x =element_blank(), axis.ticks.y =element_blank(), axis.title.y =element_blank())+coord_cartesian(x =c(0, 11000))combined_plot<-cowplot::plot_grid(main_plot,side_plot+theme(plot.margin =margin(0, 10, 0, -5)), ncol =2, rel_widths =c(4, 1), align ="h")girafe( ggobj =combined_plot, options =list(ggiraph::opts_tooltip( opacity =0.8, use_fill =TRUE, use_stroke =FALSE, css ="padding:5pt;font-family: Open Sans;font-size:1rem;color:white")))
Fishery Metrics
This section presents key metrics used to monitor and assess fishery dynamics across different Beach Management Units (BMUs). These standardized indicators allow for comparisons across different areas and time periods while accounting for sampling effort and spatial variations.
The metrics are calculated at different scales:
Sample Level: Basic metrics are first calculated for each landing record
Monthly Aggregation: Sample-level metrics are then aggregated monthly by BMU
Note
Quality Control: BMUs with more than 75% missing observations are excluded
Point size in plots indicates number of samples, providing visual indication of data reliability.